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Given independent samples from P and Q, two-sample permu- 
tation tests allow one to construct exact level tests when the null 
CNj ' hypothesis is P — Q. On the other hand, when comparing or testing 

CN . particular parameters 8 of P and Q, such as their means or medi- 

ans, permutation tests need not be level a, or even approximately 
level a in large samples. Under very weak assumptions for comparing 
estimators, we provide a general test procedure whereby the asymp- 
totic validity of the permutation test holds while retaining the exact 
f^ rejection probability a in finite samples when the underlying distri- 

jrt ■ butions are identical. The ideas are broadly applicable and special 

attention is given to the fc-sample problem of comparing general pa- 
rameters, whereby a permutation test is constructed which is exact 
level a under the hypothesis of identical distributions, but has asymp- 
totic rejection probability a under the more general null hypothesis of 
^ . equality of parameters. A Monte Carlo simulation study is performed 

ON ' as well. A quite general theory is possible based on a coupling con- 

C*) i struction, as well as a key contiguity argument for the multinomial 

(— ' ' ■ and multivariate hypergeometric distributions. 

in 

^^ \ 1. Introduction. In this article, we consider the behavior of two-sample 

(and later also /s-sample) permutation tests for testing problems when the 
fundamental assumption of identical distributions need not hold. Assume 
X\,. . . ,X m are i.i.d. according to a probability distribution P, and inde- 
pendently, Yi,...,Y n are i.i.d. Q. The underlying model specifies a family 
!— ! . of pairs of distributions (P,Q) in some space O. For the problems consid- 

ered here, f2 specifies a nonparametric model, such as the set of all pairs of 
distributions. Let N = m + n, and write 

(1.1) Z=(Z 1 ,...,Z N ) = (X 1 ,...,X m ,Y 1 ,...,Y n ). 



Received July 2012; revised December 2012. 

Supported by NSF Grant DMS-07-07085. 

AMS 2000 subject classifications. Primary 62E20; secondary 62G10. 

Key words and phrases. Behrens-Fisher problem, coupling, permutation test. 



This is an electronic reprint of the original article published by the 

Institute of Mathematical Statistics in The Annals of Statistics, 

2013, Vol. 41, No. 2, 484-507. This reprint differs from the original in pagination 

and typographic detail. 

1 



2 E. CHUNG AND J. P. ROMANO 

Let A = {(P,Q) :P = Q}. Under the assumption (P,Q) G f^, the joint dis- 
tribution of (Zi, . . . , Zn) is the same as (Z w ^, . . . , Z W ^ N ^), where (7r(l), . . . , 
tt(N)) is any permutation of {1, . . . ,N}. It follows that, when testing any 
null hypothesis H q : (P,Q) G fio> where fio C f), then an exact level a test 
can be constructed by a permutation test. To review how, let Gn denote the 
set of all permutations it of {1, . . . , N}. Then, given any test statistic T m ^ n = 
T m! n(Zi, . . . ,Zn), recompute T m>n for all permutations it; that is, compute 
T m ,n{Zn(i)-, ..., ^(Ar)) for all 7T G Gtv, and let their ordered values be 

T(i) <tW <...<T( Ar! ) 

Fix a nominal level a, < a < 1, and let k be defined by A; = N\ — [cdV!], 
where [aN\] denotes the largest integer less than or equal to aN\. Let M + (z) 

and M°(z) be the number of values Tm,n(z) (j = 1, . . . , N\) which are greater 
than T^ k \z) and equal to T^ k '(z), respectively. Set 

, , aN\-M+(z) 
a{2) = M°(z) ■ 
Define the randomization test function <f>(Z) to be equal to 1, a{Z) or 
according to whether T m>n (Z) > T^ n {Z), T m , n (X) = T^ k ){Z) or T m>n (Z) < 
T^ k '(Z), respectively. Then, under any (P,Q) € Q, 

E PiQ [<l>(X 1 ,...,X m ,Y 1 ,...,Y n )] = a. 
Also, define the permutation distribution as 

(1.2) pT in (t) = — 22 I{ T rn,n(Z 7T{l - ) ,...,Z 7r ( N - ) )<t}. 

' 7reG]v 

Roughly speaking (after accounting for discreteness), the permutation test 

rejects Hq if the test statistic T m ^ n exceeds Tm, n , or a 1 — a quantile of this 
permutation distribution. 

It may be helpful to consider an alternative description of the permu- 
tation distribution given in (1.2). As a shorthand, for any n G Gat, let 
Z n = (^(i), • • ■ ,ZirM))- Let II denote a random permutation, uniformly 
distributed over Gat. Then, T„ v „(Zn) denotes the random variable that 
evaluates the test statistic, not at the original data Z, but at a randomly 
permuted data set Zu- The permutation distribution i?^ n (-) given in (1.2) 
is evidently the conditional distribution of T m ^ n (Zjj) given Z, because con- 
ditional on the data Z, T m ^ n {Zn) is equally likely to be any of T m<n {Z^) 
among tt G Gat. The asymptotic behavior of this (conditional) distribution 
Rmn(') i s the key to establishing properties of the permutation test. 

Although the rejection probability of the permutation test is exactly a 
when P = Q, problems arise if Qq is strictly bigger than Cl. Since a trans- 
formed permuted data set no longer has the same distribution as the original 
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data set, the argument leading to the construction of an a level test fails, 
and faulty inferences can occur. 

To be concrete, if we are interested in testing equality of means, for ex- 
ample, then J7o = {(-P, Q) : /-t(-P) = n(Q)} which, of course, is strictly bigger 
than £1. So, consider constructing a permutation test based on the difference 
of sample means 

(1.3) T m>n = y/N{X m -Y n ). 

Note that we are not taking the absolute difference, so that the test is one- 
sided, as we are rejecting for large positive values of the difference. First of 
all, we are not concerned about testing = {(P, Q):P = Q}, but something 
bigger than fj. However, we underscore the point that a test statistic (1.3) is 
not appropriate for testing Cl without further assumptions because the test 
clearly will not have any power against distributions P and Q whose means 
are identical but P ^ Q. 

The permutation test based on the difference of sample means is only 
appropriate as a test of equality of population means. However, the permu- 
tation test no longer controls the level of the test, even in large samples. 
As is well known (Romano [23]), the permutation test possesses a certain 
asymptotic robustness as a test of difference in means if m/n — > 1 as n — > oo, 
or the underlying variances of P and Q are equal, in the sense that the re- 
jection probability under the null hypothesis of equal means tends to the 
nominal level. Without equal variances or comparable sample sizes, the re- 
jection probability can be much larger than the nominal level, which is a 
concern. Because of the lack of robustness and the increased probability of a 
type 1 error, rejection of the null may incorrectly be interpreted as rejection 
of equal means, when in fact it is caused by unequal variances and unequal 
sample sizes. Even more alarming is the possibility of rejecting a two-sided 
null hypothesis when observing a positive large difference with the accom- 
panying inference that mean difference is positive when in fact the difference 
in means is negative, a type 3 error or directional error. Indeed, if for some 
P and Q with equal means the rejection probability is, say, 7 3> a, then 
it follows by continuity that the rejection probability under some P and Q 
with negative mean difference will be nearly 7 as well, where one would con- 
clude that the mean difference is actually positive. Further note that there is 
also the possibility that the rejection probability can be much less than the 
nominal level, which by continuity implies the test is biased and has little 
power of detecting a true difference in means, or large type 2 error. 

The situation is even worse when basing a test on a difference in sample 
medians, in the sense that regardless of sample sizes, the asymptotic rejection 
probability of the permutation test will be a under very stringent conditions, 
which essentially means only in the case where the underlying distributions 
are the same. 
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However, in a very insightful paper in the context of random censoring 
models, Neuhaus [18] realized that by proper studentization of a test statis- 
tic, the permutation test can result in asymptotically valid inference even 
when the underlying distributions are not the same. This result has been 
extended to other specific problems, such as comparing means by Janssen [9] 
and certain linear statistics in Janssen [10] (including the Wilcoxon statistic 
without ties), variances by Pauly [20] and the two-sample Wilcoxon test by 
Neubert and Brunner [17] (where ties are allowed). Other results on permu- 
tation tests are presented in Janssen [11], Janssen and Pauls [12], Janssen 
and Pauls [13] and Janssen and Pauly [14]. The recent paper by Omelka 
and Pauly [19] compares correlations by permutation tests, which is a spe- 
cial case of our general results. Note that the importance of studentization 
when bootstrapping is well known; see Hall and Wilson [7] and Delaigle et 
al. [3] (though its role for bootstrap is to obtain higher order accuracy while 
in the context here first order accuracy can fail without studentization). 

The goal of this paper is to obtain a quite general result of the same phe- 
nomenon. That is, when basing a permutation test using some test statistic 
as a test of a parameter (usually a difference of parameters associated with 
marginal distributions) , we would like to retain the exactness property when 
P = Q, and also have the asymptotic rejection probability be a for the more 
general null hypothesis specifying the parameter (such as the difference be- 
ing zero). Of course, there are many alternatives to getting asymptotic tests, 
such as the bootstrap or subsampling. However, we do not wish to give up the 
exactness property under P = Q, and resampling methods do not have such 
finite sample guarantees. The main problem becomes: what is the asymp- 
totic behavior of Rmn(') defined in (1.2) for general test statistic sequences 
Tm,n when the underlying distributions differ. Only for suitable test statis- 
tics is it possible to achieve both finite sample exactness when the underlying 
distributions are equal, but also maintain a large sample rejection probabil- 
ity near the nominal level when the underlying distributions need not be 
equal. In this sense, our results are both exact and asymptotically robust 
for heterogeneous populations. 

This paper provides a framework for testing a parameter that depends 
on P and Q (and later on k underlying distributions Pi for i = l,...,k). 
We construct a general test procedure where the asymptotic validity of the 
permutation test holds in a general setting. Assuming that estimators are 
asymptotically linear and consistent estimators are available for their asymp- 
totic variance, we provide a test that has asymptotic rejection probability 
equal to the nominal level a, but still retains the exact rejection probability 
of a in finite samples if P = Q in Section 2. It is not even required that 
the estimators are based on differentiable functionals, and some methods 
like the bootstrap would not necessarily be even asymptotically valid under 
such conditions, let alone retain the finite sample exactness property when 
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P = Q. In Section 3, generalizations of the results are discussed with a spe- 
cial attention to the more general fc-sample problem of comparing general 
parameters. Furthermore, Monte Carlo simulation studies illustrating our 
results are presented in Section 4. The arguments of the paper are quite 
different from Janssen and previous authors, and hold under great general- 
ity. For example, they immediately apply to comparing means, variances or 
medians. The key idea is to show that the permutation distribution behaves 
like the unconditional distribution of the test statistic when all N observa- 
tions are i.i.d. from the mixture distribution pP + (1 — p)Q, where p is such 
that m/N — > p. This seems intuitive because the permutation distribution 
permutes the observations so that a permuted sample is almost like a sample 
from the mixture distribution. In order to make this idea precise, a coupling 
argument is given in Section 5.3. Of course, the permutation distribution 
depends on all permuted samples (for a given original data set). But even 
for one permuted data set, it cannot exactly be viewed as a sample from 
pP + (1 — p)Q. Indeed, the first m observations from the mixture would 
include B m observations from P and the rest from Q, where B m has the 
binomial distribution based on m trials and success probability p. On the 
other hand, for a permuted sample, if H m denotes the number of observations 
from P, then H m has the hypergeometric distribution with mean mp. The 
key argument that allows for such a general result concerns the contiguity 
of the distributions of B m and H m . Section 5 highlights the main technical 
ideas required for the proofs. All proofs are deferred to the supplementary 
appendix [2]. 

2. Robust studentized two-sample test. In this section, we consider the 
general problem of inference from the permutation distribution when com- 
paring parameters from two populations. Specifically, assume X±, . . . ,X m 
are i.i.d. P and, independently, Yi,...,Y n are i.i.d. Q. Let #(•) be a real- 
valued parameter, defined on some space of distributions V '. The problem is 
to test the null hypothesis 

(2.1) H :6(P) = 9(Q). 

Of course, when P = Q, one can construct permutation tests with exact level 
a. Unfortunately, if P ^ Q, the test need not be valid in the sense that the 
probability of a type 1 error need not be a even asymptotically. Thus, our 
goal is to construct a procedure that has asymptotic rejection probability 
equal to a quite generally, but also retains the exactness property in finite 
samples when P = Q. 

We will assume that estimators are available that are asymptotically 
linear. Specifically, assume that, under P, there exists an estimator 6 m = 
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9 m (Xi, . . . , X m ) which satisfies 

(2.2) m x l 2 \B m - 9{P)\ = -= Yjp(Xi) + o P (l). 



Similarly, we assume that, based on the Yj (under Q), 

1 n 
(2-3) n l ' 2 [9 n - 9(Q)] = — Y^f Q {Y 3 ) + oq(1). 

The functions determining the linear approximation fp and /q can of course 
depend on the underlying distributions. Different forms of differentiability 
guarantee such linear expansions in the special case when 9 m takes the form 
of an empirical estimate 9(P m ), where P m is the empirical measure con- 
structed from X\, . . . ,X m , but we will not need to assume such stronger 
conditions. We will argue that our assumptions of asymptotic linearity al- 
ready imply a result about the permutation distribution corresponding to 

the statistic N l / 2 [9 m (Xi,. . . , X m ) — 9„(Yx,. . . , 5^)], without having to im- 
pose any differentiability assumptions. However, we will assume the expan- 
sion (2.2) holds not just for i.i.d. samples under P, and also under Q, 
but also when sampling i.i.d. observations from the mixture distribution 
P = pP + qQ. This is a weak assumption and replaces having to study the 
permutation distribution based on variables that are no longer independent 
nor identically distributed with a simple assumption about the behavior 
under an i.i.d. sequence. Indeed, we will argue that in all cases, the permu- 
tation distribution behaves asymptotically like the unconditional limiting 
sampling distribution of the studied statistic sequence when sampling i.i.d. 
observations from P. 

In the next two theorems, the behavior of the permutation distribution is 
obtained. Note that it is not assumed that the null hypothesis 9(P) = 9(Q) 
necessarily holds. Indeed, the asymptotic behavior of the permutation test 
under P and Q is the same as when all observations are from the mixture 
distribution P = pP + (1 — p)Q, where p = lim S. Proofs of all the results in 
Section 2 are presented along with proofs of the results in Section 5 in the 
supplementary appendix [2]. 

Theorem 2.1. Assume Xi,...,X m are i.i.d. P and, independently, 
Yi,...,Y n are i.i.d. Q. Consider testing the null hypothesis (2.1) based on a 
test statistic of the form 

T m ,n = N^OmiX!, ...,X m )- 9 n (Y 1 , . . . ,Y n )], 

where the estimators satisfy (2.2) and (2.3). Further assume Epfp(Xi) = 
and 

0<E P f P (X l ) = a 2 (P)<oo 
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and the same with P replaced by Q. Let m — > oo, n — > oo ; with N = m + n, 
p m = m/N, q m = n/N and p m — > p £ (0, 1) with 

(2.4) Pm -p = 0(N- 1 ' 2 ). 

Assume the estimator sequence also satisfies (2.2) with P replaced by P = 
pP + qQ with cr 2 (P) < oo. 

Then the permutation distribution ofT m ^ n given by (1.2) satisfies 



sup|i2^ n (t)-*(t/r(P))|40, 
t 



where 



(2.5) T 2 { p ) = _L_ a 2 { py 

p(l-p) 

Remark 2.1. Under Hq given by (2.1), the true unconditional sampling 
distribution of T mn is asymptotically normal with mean and variance 

(2.6) ~a\P) + ^— a\Q), 

p 1 — p 

which does not equal t 2 (P) defined by (2.5) in general. 

Example 2.1 (Difference of means). As is well known, even for the case 
of comparing population means by sample means, under the null hypothesis 
that 6(P) = 6(Q), equality of (2.5) and (2.6) holds if and only if p = 1/2 or 
a 2 (P) = a 2 (Q). 

Example 2.2 (Difference of medians). Let F and G denote the c.d.f.s 
corresponding to P and Q. Let 0(F) denote the median of F, that is, 
9(F) = uii{x:F(x) > \}. Then it is well known (Serfling [24]) that if F 
is continuously differentiable at 0(P) with derivative F' (and the same with 
F replaced by G), then 

and similarly, with P and F replaced by Q and G. Thus, we can apply 
Theorem 2.1 and conclude that, when 9(P) = 9(Q) = 9, the permutation 
distribution of T m ^ n is approximately a normal distribution with mean 
and variance 



Ap(l-p)\pF>(9) + (l-p)G'(9)\ 



l(dW2 
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in large samples. On the other hand, the true sampling distribution is ap- 
proximately a normal distribution with mean and variance 

(2.7) v (P,Q) = - 4[jP( )]2 + ! _ p4[G !/(0)]2- 

Thus the permutation distribution and the true unconditional sampling dis- 
tribution behave differently asymptotically unless F'(6) = G'(9) is satisfied. 
Since we do not assume P = Q, this condition is a strong assumption. Hence, 
the permutation test for testing equality of medians is generally not valid in 
the sense that the rejection probability tends to a value that is far from the 
nominal level a. 

The main goal now is to show how studentizing the test statistic leads to 
a general correction. 

Theorem 2.2. Assume the setup and conditions of Theorem 2.1. Fur- 
ther assume that & m (Xi, . . . ,X m ) is a consistent estimator of a(P) when 

X±,. . . ,X m are i.i.d. P. Assume consistency also under Q and P, so that 

p - - 

cr n (Vi, . . . ,V n ) — >• o~(P) as n — )■ oo when the Vi are i.i.d. P. Define the stu- 

dentized test statistic 

T 

(2.8) S m , 

where 



Vn 



N N 

Vm,n = \ — O-UXl, ■■■,X m ) + — £2(Yi, . . . Y n ) 

V m n 

and consider the permutation distribution defined in (1.2) with T replaced 
by S. Then 

(2.9) sup|i^ )n (t)-*(t)|4o. 

t 

Thus the permutation distribution is asymptotically standard normal, as 
is the true unconditional limiting distribution of the test statistics S mjn . 
Indeed, as mentioned in Remark 2.1, the true unconditional limiting distri- 
bution of T mtn is normal with mean and variance given by (2.6). But, when 
sampling m observations from P and n from Q, V^ in tends in probability 
to (2.6), and hence the limiting distribution of T m ^ n is standard normal, the 
same as that of the permutation distribution. 

Remark 2.2. As previously noted, Theorems 2.1 and 2.2 are true even if 
9(P) 7^ &(Q). If d(P) = 0(Q), then the true sampling distribution of S myn and 
the permutation test become approximately the same. However, if 9{P) ^ 
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0(Q), then we get the power tending to 1. Indeed, the critical value from 
the permutation distribution asymptotically tends to a finite value Z\- a in 
probability, while the test statistic tends to infinity in probability. Also, see 
Remark 2.3 for local power. 

Example 2.1 (Continued). As proved by Janssen [9], even when the 
underlying distributions may have different variances and different sample 
sizes, permutation tests based on studentized statistics 

N l / 2 {X m -Y n ) 

>5'rn,n — , > 

lNS 2 x /m + NSl/n 

where S 2 X = ^ ZT=l( X i ~ X m ? and S 2 . = ^ £?=i(^ - Y m ) 2 , can allow 
one to construct a test that attains asymptotic rejection probability a when 
P ^Q while providing an additional advantage of maintaining exact level a 
when P = Q. 

Example 2.2 (Continued). Define the studentized median statistic 

«, JV 1/2 [fl(An)-fl(Qn)] 

Vm,n 

where v m<n is a consistent estimator of v(P, Q) defined in (2.7). There are sev- 
eral choices for a consistent estimator of v(P, Q). Examples include the usual 
kernel estimator (Devroye and Wagner [4]), bootstrap estimator (Efron [5]), 
and the smoothed bootstrap (Hall, DiCiccio, and Romano [6]). 

Remark 2.3. Suppose that the true unconditional distribution of a 
test T m>n is, under the null hypothesis, asymptotically given by a distri- 
bution R(-). Typically a test rejects when T m>n > r m>n , where r mtn is non- 
random, as happens in many classical settings. Then, we typically have 
r m,n —* r (l — a) = R~ l {\ — a). Assume that T m ^ n converges to some limit 
law R'(-) under some sequence of alternatives which are contiguous to some 
distribution satisfying the null. Then, the power of the test against such a 
sequence would tend to 1 — R'(r(l — a)). The point here is that, under the 
conditions of Theorem 2.2, the permutation test based on a random critical 
value r mj „ obtained from the permutation distribution satisfies, under the 

null, f m>n — > r(l — a). But then, contiguity implies the same behavior under 
a sequence of contiguous alternatives. Thus, the permutation test has the 
same limiting local power as the "classical" test which uses the nonrandom 
critical value. So, to first order, there is no loss in power in using a permu- 
tation critical value. Of course, there are big gains because the permutation 
test applies much more broadly than for usual parametric models, in that it 
retains the level exactly across a broad class of distributions and is at least 
asymptotically justified for a large nonparametric family. 
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3. Generalizations. 

3.1. Wilcoxon statistic and general U-statistics. So far, we considered 
two-sample problems where the statistic is based on the difference of es- 
timators that are asymptotically linear. Although this class of estimators 
includes many interesting cases such as testing equality of means, medi- 
ans, and variances, it does not include other important statistics like the 
Wilcoxon statistic or some rank statistics where the parameter of interest is 
a function of the joint distribution 8(P,Q) and not just a simple difference 
8(P)-9(Q). 

In our companion paper (Chung and Romano [1]), however, we consider 
these statistics in a more general [/-statistic framework. More specifically, as- 
sume that X±, . . . , X m are i.i.d. P, and independently, Y\, . . . , Y n are i.i.d. Q. 
The problem studied is to test the null hypothesis 

flb:Ej>< 3 (p(Xi, . • • ,X r ,Y u . . .,Y r )) = 0, 

which can be estimated by its corresponding two-sample [/-statistic of the 
form 

U m ,n(Z) = j-r 22 /2 ^C^oi ' • • • ' X <*r , Y 0i,---,Yj3 r ), 

\r)\r) a f) 

where a and (3 range over the sets of all unordered subsets of r different 
elements chosen from { 1 , . . . , m} and of r different elements chosen from 
{1, . . . , n}, respectively. 

This general class of [/-statistics covers, for example, Lehmann's two- 
sample [/-statistic to test Hq : P(\Y' — Y\ > \X' — X\) = 1/2, the two-sample 
Wilcoxon statistic to test Hq:P(X < Y) = P(Y < X), and some other in- 
teresting rank statistics. Under quite weak assumptions, we provide a gen- 
eral theory whereby one can construct a permutation test of a parameter 
9(P, Q) = 9q which controls the asymptotic probability of a type 1 error in 
large samples while retaining the exactness property in finite samples when 
the underlying distributions are identical. The technical arguments involved 
in this [/-statistic problem are different from Section 2, but the mathematics 
and statistical foundations to be laid out in Section 5 provide fundamental 
ingredients that aid our asymptotic derivations. 

3.2. Robust k-sample test. From our general considerations, we are now 
guided by the principle that the large sample distribution of the test statistic 
should not depend on the underlying distributions; that is, it should be 
asymptotically pivotal under the null. Of course, it can be something other 
than normal, and we next consider the important problem of testing equality 
of parameters of fc-samples (where a limiting Chi-squared distribution is 
obtained) . 

Assume we observe k independent samples of i.i.d. observations. Specif- 
ically, assume Xn,. . . ,Xi n . are i.i.d. Pj. Some of our results will hold for 
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fixed m, . . . , rik, but we also have asymptotic results as N = Y2i n i ~ > °°- Let 
n = (ni, . . . ,rik), and the notation n — > oo will mean minjnj — > oo. Let #(•) 
be a real- valued parameter, defined on some space of distributions V . The 
probleiu of interest is to test the null hypothesis 



(3.1) 

against the alternative 



Ho-.e^) 



0(Pk) 



Hx : 9(Pi) + 0(Pj) for some i,j. 

When P\ = ■ ■ ■ = Pk holds, one can construct permutation tests with exact 
level q. However, if Pi ^ Pj for some i,j, then the test may fail to achieve 
the rejection probability equal to a even asymptotically. 

We will assume that asymptotically linear estimators are available, that 
is, (2.2) holds for i.i.d. samples under Pi for i = 1, . . . ,k, where fp i can 
depend on the underlying distribution Pj. Further assume that the expansion 
also holds for i.i.d. observations Zi t i,. . . ,Zi ni sampled from the mixture 
distribution P = ^2n = iPiPi, where rii/N — > pi. Note that the asymptotic 
linearity conditions need not require any form of differentiability (though of 
course, some form of differentiability is a sufficient condition). We will argue 
that the asymptotic linearity conditions under Pj for i = 1, . . . , k and P, are 
sufficient to derive the asymptotic behavior of the fc-sample permutation 
distribution based on T n> i (defined below), without having to impose any 
differentiability conditions. 

The goal here is to construct a method that retains the exact control of 
the probability of a type 1 error when the observations are i.i.d., but also 
asymptotically controls the probability of a type 1 error under very weak 
assumptions, specifically finite nonzero variances of the influence functions. 



Lemma 3.1. Consider the above set-up. Assume (2.2) holds for Pi, ■ ■ ■ ,Pk 
with < of = of (/pj) = Epj/p.pfjj) < oo. Assume rii — > oo with rii/N — > 
Pi > for i = 1, . . . , k. Let 



(3.2) 



Tnfl 



k 

ETli 



Yli=i n$n,i/o-. 



TLin % /ai 



where 9 nti = 9 nt i(X it i,. . . ,X i>n J and of = o-f(f Pi ) = E Pi fp.(Xij). Further 
assume that a Ht i = ov^-X^i, . . . , Xi tUi ) is a consistent estimator of Oi = 
o"j(/pJ when Xi t i, ... ,Xi jTli are i.i.d. Pi, for i = 1, . . . ,k. Define 



(3.3) 



T, 



n,l 



E 



l & n,i 



Ek 
i=l n i 



,i/°n,i 



siwe 



Then, under Hq, both T n> o and T n> i converge in distribution to the Chi- 
squared distribution with k — 1 degrees of freedom. 
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Let R n ,i(-) denote the permutation distribution corresponding to T n ,i- in 
words, T n> i is recomputed over all permutations of the data. Specifically, if 
we let 

(Z\,. . . ,Zn) = (-Xljlj- . . ,Xi jTll ,X2 t l, ■ ■ ■ ,X2 t n 2 i ■ ■ ■ >Xk,l,. ■ ■ ,Xfc )Tlfe ), 

then, Rn t i(t) is formally equal to the right-hand side of (1.2), with T m ^ n 
replaced by T n ,i- 

Theorem 3.1. Assume the same setup and conditions of Lemma 3.1 
with < o~1 = erf (f p^ = Fip i fp.(Xij) < oo. Assume rii — > oo with m/N — > 
Pi > 0. Further assume that the consistency of ' a n% i of Oi under Pi also holds 
under P as well so that, when the Zi are i.i.d. P, 

as n — > oo 



&n,i(Zi, . . 


■,Z n% )^o-(fp) 


with < cr 2 (fp) < oo. 




Then, under Hq, 





(3.4) 4,i(t)'Gn(t), 

where Gj denotes the Chi-squared distribution with d degrees of freedom. 
Moreover, if P±, . . . ,P^ satisfy Hq, then the probability that the permutation 
test rejects Hq tends to the nominal level a. 

Example 3.1 (Nonparametric ^-sample Behrens-Fisher problem). Con- 
sider the special case where 9i(P) = Hi(P) is the population mean. Also, let 
6 n< i be the sample mean of the ith sample. When the populations are as- 
sumed normal with possibly different unknown variances, this is the classical 
Behrens-Fisher problem. Here, we do not assume normality and provide a 
general solution for testing the equality of parameters of several distribu- 
tions. Indeed, we have exact finite sample type 1 error control when all the 
populations are the same, and asymptotically type 1 error control when the 
populations are possibly distinct. (Some relatively recent large sample ap- 
proaches which do not retain our finite sample exactness property to this 
specific problem are given in Rice and Gaines [21] and Krishnamoorthy, Lu 
and Mathew [15].) 

4. Simulation results. Monte Carlo simulation studies illustrating our 
results are presented in this section. Table 1 tabulates the rejection proba- 
bilities of one-sided tests for the studentized permutation median test where 
the nominal level considered is a = 0.05. The simulation results confirm 
that the studentized permutation median test is valid in the sense that it 
approximately attains level a in large samples. 

In the simulation, odd numbers of sample sizes are selected in the Monte 
Carlo simulation for simplicity. We consider several pairs of distinct sample 



EXACT AND ASYMPTOTICALLY ROBUST PERMUTATION TESTS 13 

Table 1 

Monte Carlo simulation results for studentized permutation median test 

(one-sided, a — 0.05,) 





m: 


5 


13 


51 


101 


101 


201 


401 


Distributions 


n: 


5 


21 


101 


101 


201 


201 


401 


2V(0,1) 


Not studentized 


0.1079 


0.1524 


0.1324 


0.2309 


0.2266 


0.2266 


0.2249 


JV(0,5) 


Studentized 


0.0802 


0.1458 


0.095 


0.0615 


0.0517 


0.0517 


0.0531 


#(0,1) 


Not studentized 


0.0646 


0.1871 


0.2411 


0.1769 


0.1849 


0.1849 


0.1853 


T(5) 


Studentized 


0.0707 


0.1556 


0.0904 


0.0776 


0.0661 


0.0661 


0.0611 


Logistic (0, 1) 


Not studentized 


0.0991 


0.1413 


0.1237 


0.2258 


0.2233 


0.2233 


0.2261 


f/(-10,10) 


Studentized 


0.0771 


0.1249 


0.0923 


0.0686 


0.0574 


0.0574 


0.0574 


Laplace(ln2, 1) 


Not studentized 


0.0420 


0.0462 


0.0477 


0.048 


0.0493 


0.0461 


0.0501 


exp(l) 


Studentized 


0.0386 


0.0422 


0.0444 


0.0502 


0.0485 


0.0505 


0.0531 



distributions that share the same median as listed in the first column of Ta- 
ble 1. For each situation, 10,000 simulations were performed. Within a given 
simulation, the permutation test was calculated by randomly sampling 999 
permutations. Note that neither the exactness properties nor the asymp- 
totic properties are changed at all (as long as the number of permutations 
sampled tends to infinity). For a discussion on stochastic approximations 
to the permutation distribution, see the end of Section 15.2.1 in Lehmann 
and Romano [16] and Section 4 in Romano [22]. As is well known, when the 
underlying distributions of two distinct independent samples are not identi- 
cal, the permutation median test is not valid in the sense that the rejection 
probability is far from the nominal level a = 0.05. For example, although a 
logistic distribution with location parameter and scale parameter 1 and a 
continuous uniform distribution with the support ranging from —10 to 10 
have the same median of 0, the rejection probability for the sample sizes 
examined is between 0.0991 and 0.2261 and moves further away from the 
nominal level a = 0.05 as sample sizes increase. 

In contrast, the studentized permutation test results in rejection probabil- 
ity that tends to the nominal level a asymptotically. We apply the bootstrap 
method (Efron [5] ) to estimate the variance for the median Af zia\ in the sim- 
ulation given by 



m]T[X (0 - 6(P m )] 2 ■ P(9(P m ) = X {1) ), 
i=i 

where for an odd number m, 

V{e(P* m ) = X {1) ) = P (Binomial (m, l -^f\ < ^j^) 
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I \ 777 - 

P Binomial m, — I < 



\ \ 771/ 2 

As noted earlier, there exist other choices such as the kernel estimator and 
the smoothed bootstrap estimator. We emphasize, however, that using the 
bootstrap to obtain an estimate of standard error does not destroy the 
exactness of permutation tests under identical distributions. 

5. Four technical ingredients. In this section, we discuss four separate 
ingredients, from which the main results flow. These results are separated 
out so they can easily be applied to other problems and so that the main 
technical arguments are highlighted. The first two apply more generally to 
randomization tests, not just permutation tests, and are stated as such. 

5.1. Hoeff ding's condition. Suppose data X n has distribution P n in X n , 
and G n is a finite group of transformations g of X n onto itself. For a given 
statistic T n = T n (X n ), let -R„(-) denote the randomization distribution of 
T n , defined by 

(5-1) i%(t) = -r± 1 Y,I{T n (gX n )<t}, 

1 n| geG n 

where \G n \ denotes the cardinality of G n . Hoeffding [8] gave a sufficient 
condition to derive the limiting behavior of -R^(-). This condition is verified 
repeatedly in the proofs, but we add the result that the condition is also 
necessary. 

Theorem 5.1. Let G n and G' n be independent and uniformly distributed 
over G n (and independent of X n ). Suppose, under P n , 

(5.2) (T n (G n X n ),T n (G' n X n ))A(T,T% 

where T and T' are independent, each with common c.d.f. R (•). Then, for 
all continuity points t of R T (-), 

(5.3) R T n (t)^R T (t). 

Conversely, if (5.3) holds for some limiting c.d.f. R T (-) whenever t is a 
continuity point, then (5.2) holds. 

The reason we think it is important to add the necessity part of the result 
is that our methodology is somewhat different than that of other authors 
mentioned in the Introduction, who take a more conditional approach to 
proving limit theorems. After all, the permutation distribution is indeed 
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a distribution conditional on the observed set of observations (without re- 
gard to ordering). However, the theorem shows that a sufficient condition 
is obtained by verifying an unconditional weak convergence property. Nev- 
ertheless, simple arguments (see the supplementary appendix [2]) show the 
condition is indeed necessary and so taking such an approach is not fanciful. 

5.2. Slutsky's theorem for randomization distributions. Consider the gen- 
eral setup of Section 5.1. The result below describes Slutsky's theorem in 
the context of randomization distributions. In this context, the randomiza- 
tion distributions are random themselves, and therefore the usual Slutsky's 
theorem does not quite apply. Because of its utility in the proofs of our main 
results, we highlight the statement. Given sequences of statistics T n , A n and 
B n , let R^ T+B (-) denote the randomization distribution corresponding to 
the statistic sequence A n T n + B n ; that is, replace T n in (5.1) by A n T n + B n , 
so 

(5.4) itf T+B (t) = -^ J2 HMgX n )T n (gX n ) + B n ( g x n ) < t}. 

Theorem 5.2. Let G n and G' n be independent and uniformly distributed 
over G n (and independent of X n ). Assume T n satisfies (5.2). Also, assume 

(5.5) A n {G n X n )^a 

and 

(5.6) B n (G n X n ) 4 b 

for constants a and b. Let R + (•) denote the distribution of aT + b, where 
T is the limiting random variable assumed in (5.2). Then 

Rf+ B (t)^R aT + b (t), 

if the distribution R aT+b '(•) of aT + b is continuous att. [Of course, R aT+b (t) = 
R T C-^)ifa^0.] 

5.3. A coupling construction. Consider the general situation where k 
samples are observed from possibly different distributions. Specifically, as- 
sume for i = 1, . . . , k that Xii, . . . , Xi >ni is a sample of rii i.i.d. observations 
from Pi. All N = Yli n i observations are mutually independent. Put all the 
observations together in one vector 

Z = (-Xl,l,- . . , Il, ni , -X"2,l, . . . , X-i^n-2.1- • ■ )^fc,lj ■ • ■ ;^fc,n fe )- 

The basic intuition driving the results concerning the behavior of the 
permutation distribution stems from the following. Since the permutation 
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distribution considers the empirical distribution of a statistic evaluated at 
all permutations of the data, it clearly does not depend on the ordering of 
the observations. Let rii/N denote the proportion of observations in the ith 
sample, and let pi = lim ni _>. 00 m/N E (0, 1). Assume that i%i — > oo in such a 
way that 

(5-7) Pi-^ = 0{N-^). 

Then the behavior of the permutation distribution based on Z should behave 
approximately like the behavior of the permutation distribution based on a 
sample of N i.i.d. observations Z = (Z\, . . . , Zjy) from the mixture distribu- 
tion P = p\P\ + • • • + PkPk- Of course, we can think of the N observations 
generated from P arising out of a two-stage process: for i = 1, . .. ,N, first 
draw an index j at random with probability pj\ then, conditional on the 
outcome being j, sample Zi from Pj. However, aside from the fact that the 
ordering of the observations in Z is clearly that of n\ observations from 
Pi, following by rii observations from P2, etc., the original sampling scheme 
is still only approximately like that of sampling from P. For example, the 
number of observations Zi out of the N which are from Pi is binomial with 
parameters N and p\ (and so has mean equal to p\N ~ m), while the num- 
ber of observations from Pi in the original sample Z is exactly n\. 

Along the same lines, let ir = (vr(l), . . . ,ir(N)) denote a random permu- 
tation of {1, ...,N}. Then, if we consider a random permutation of both 
Z and Z, then the number of observations in the first n\ coordinates of Z 
which were AVs has the hyper geometric distribution, while the number of 
observations in the first n\ coordinates of Z which were AVs is still binomial. 

We can make a more precise statement by constructing a certain coupling 
of Z and Z . That is, except for ordering, we can construct Z to include 
almost the same set of observations as in Z . The simple idea goes as fol- 
lows. Given Z, we will construct observations Z\,. . . ,Zn via the two-stage 
process as above, using the observations drawn to make up the Zi as much 
as possible. First, draw an index j among {1, . . . , k} at random with prob- 
ability pj; then, conditionally on the outcome being j, set Z\ = Xj t i. Next, 
if the next index i drawn among { 1, . . . , k} at random with probability pi 
is different from j from which Z\ was sampled, then Zi = A^i; otherwise, 
if i = j as in the first step, set Z2 = Xj^- In other words, we are going to 
continue to use the Zi to fill in the observations Zi. However, after a certain 
point, we will get stuck because we will have already exhausted all the rij 
observations from the jth population governed by Pj. If this happens and 
an index j was drawn again, then just sample a new observation Xj^ n+ \ 
from Pj . Continue in this manner so that as many as possible of the original 
Zi observations are used in the construction of Z. Now, we have both Z and 
Z. At this point, Z and Z have many of the same observations in common. 
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The number of observations which differ, say D, is the (random) number of 
added observations required to fill up Z. (Note that we are obviously using 
the word "differ" here to mean the observations are generated from different 
mechanisms, though in fact there may be a positive probability that the 
observations still are equal if the underlying distributions have atoms. Still, 
we count such observations as differing.) 

Moreover, we can reorder the observations in Z by a permutation ttq so 
that Zi and Z vo u\ agree for all i except for some hopefully small (random) 
number D. To do this, recall that Z has the observations in order, that is, the 
first ri\ observations arose from Pi and the next set of n<i observations came 
from P2, etc. Thus, to couple Z and Z, simply put all the observations in 
Z which came from Pi first up to n\. That is, if the number of observations 
in Z from Pi is greater than or equal to m, then Z^u\ for % = 1, . . . , n\ are 
filled with the observations in Z which came from Pi, and if the number 
was strictly greater than m, put them aside for now. On the other hand, 
if the number of observations in Z which came from Pi is less than n\ , fill 
up as many of Z from Pi as possible, and leave the rest of the slots among 
the first ri\ spots blank for now. Next, move onto the observations in Z 
which came from P2 and repeat the above procedure for n\ + 1, . . . , n\ + 712 
spots; that is, we start filling up the spots from n\ + 1 as many of Z which 
came from P2 as possible up to ni of them. After going though all the 
distributions Pj from which each of observations in Z came, one must then 
complete the observations in Z no ; simply "fill up" the empty spots with the 
remaining observations that have been put aside. (At this point, it does not 
matter where each of the remaining observations gets inserted; but, to be 
concrete, fill the empty slots by inserting the observations which came from 
the index P, in chronological order from when constructed.) This permuting 
of observations in Z corresponds to a permutation ttq and satisfies Zi = Z nQ ^\ 
for indices i except for D of them. 

For example, suppose there are k = 2 populations. Suppose that iVi of 
the Z observations came from Pi and so N — N± from P2. Of course, iV"i 
is random and has the binomial distribution with parameters N and p\. 
If N\ > m, then the above construction yields the first ri\ observations in 
Z and Zjj-q completely agree. Furthermore, if Ai > m, then the number of 
observations in Z from P2 is N — N\ < N — n\ = ri2, and N — N\ of the 
last ?i2 indices in Z match those of Z no , with the remaining differ. In this 
situation, we have 

Z = (Xi ,... , X ni , Y\ , . . . , Y n2 ) 

and 

Z-k = ( X\ ,... , X ni , Y\ , . . . , Ijv-jvj , A ni+ i , . . . , AjVj ) , 
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so that Z and Z VQ differ only in the last N\ — n\ places. In the opposite 
situation where N\ < m, Z and Z^ are equal in the first N\ and last n<i 
places, only differing in spots N± + 1, . . . ,n±. 

The number of observations D where Z and Z no differ is random and it 
can be shown that 

(5.8) E(D/N) < N~ 1 / 2 ; 

see supplementary appendix [2]. In summary, the coupling construction 
shows that only a fraction of the N observations in Z and Z no differ with 
high probability. Therefore, if the randomization distribution is based on a 
statistic Tn(Z) such that the difference Tpj(Z) — Tn{Z- K0 ) is small in some 
sense whenever Z and Z no mostly agree, then one should be able to deduce 
the behavior of the permutation distribution under samples from P\, . . . , P^ 
from the behavior of the permutation distribution when all N observations 
come from the same distribution P. Whether or not this can be done re- 
quires some knowledge of the form of the statistic, but intuitively it should 
hold if the statistic cannot strongly be affected by a change in a small pro- 
portion of the observations; its validity though must be established on a case 
by case basis. Although the assessment of the validity needs to be taken on 
a case by case basis, it readily extends to a broader class of statistics such 
as "mean-like" statistics. (However, this coupling argument and the conti- 
guity results in Section 5.4 together allow us to prove quite general results.) 
The point is that it is a worthwhile and beneficial route to pursue because 
the behavior of the permutation distribution under N i.i.d. observations is 
typically much easier to analyze than under the more general setting when 
observations have possibly different distributions. Furthermore, the behav- 
ior under i.i.d. observations seems fundamental as this is the requirement 
for the "randomization hypothesis" to hold, that is, the requirement to yield 
exact finite sample inference. 

To be more specific, suppose ir and it' are independent random permuta- 
tions, and independent of the Zj and Zj. Suppose we can show that 

(5.9) {T N {Z n ),T N {Z^))A{T,T') : 

where T and T are independent with common c.d.f. R(-). Then, by Theo- 
rem 5.1, the randomization distribution based on T/v converges in probability 
to R(-) when all observations are i.i.d. according to P. But since ttttq (mean- 
ing 7r composed with ttq so tto is applied first) and tt'ttq are also independent 
random permutations, (5.9) also implies 

(T/v (Z T7ro ) , T N (Z^ )) ->■ (T, T ) . 

Using the coupling construction to construct Z, suppose it can be shown 
that 

(5.10) T N (Z nno )-T N (Z n )^0. 
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Then, it also follows that 

TAr(Z 7r / 7ro ) — Tn(Z v i) — > 0, 
and so by Slutsky's theorem, it follows that 



(5.11) 



(T N (Z„),T N (Z„,))A(T,T') 



Therefore, again by Theorem 5.1, the randomization distribution also con- 
verges in probability to R(-) under the original model of k samples from 
possibly different distributions. In summary, the coupling construction of 
Z, Z and ttq and the one added requirement (5.10) allow us to reduce the 
study of the permutation distribution under possibly k different distribu- 
tions to the i.i.d. case when all N observations are i.i.d. according to P. We 
summarize this as follows. 

Lemma 5.1. Assume (5.9) and (5.10). Then (5.11) holds, and so the 
permutation distribution based on k samples from possibly different distribu- 
tions behaves asymptotically as if all observations are i.i.d. from the mixture 
distribution P and satisfies 

if t is a continuity point of the distribution R of T in (5.9). 



Example 5.1 (Difference of sample means). To appreciate what is in- 
volved in the verification of (5.10), consider the two-sample problem con- 
sidered in Theorem 2.1, in the special case of testing equality of means. 
The unknown variances may differ and are assumed finite. Consider the test 
statistic T m ^ n = A^" 1 / 2 [X m — Y n \. By the coupling construction, Z. K1T0 and Z^ 
have the same components except for at most D places. Now, 



-^m,n\^-KTTo) -Lm,n\^iT ) — ™ 



1 m 



7T7ro(i) 



Z ir(i)) 



N 1 ' 2 



1 N 
n ^ 



[Z 



TTOO') K{j)> 



j=m+l 



All of the terms in the above two sums are zero except for at most D of 

them. But any nonzero term like Z v ^ (a\ 

by 



Z n ^ has variance bounded above 



2max(Var(Xi), Var(Yi)) < oo. 

Note the above random variable has mean zero under the null hypothesis 
that E{Xi) = E(Yj). To bound its variance, condition on D and ir, and note 
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it has conditional mean and conditional variance bounded above by 

N \ 9 , 2max(Var(AM,Var(yi))£> 

xs\m\mr ,n z ) 

and hence unconditional variance bounded above by 

N r^^2max(Var(AM,Var(Yi))0(iV 1/2 ) = 0{N~ 1/2 ) =o(l), 

mm(m , n z ) 

implying (5.10). In words, we have shown that the behavior of the permu- 
tation distribution can be deduced from the behavior of the permutation 
distribution when all observations are i.i.d. with mixture distribution P. 

Two final points are relevant. First, the limiting distribution R is typically 
the same as the limiting distribution of the true unconditional distribution of 
Tjv under P. This is intuitively the case because the permutation distribution 
is invariant under any permutation of the combined data, and so the set 
of N observations with exactly rii observations sampled from Pi and then 
randomly permuting them behaves very nearly the same as a sample of N 
observations from P. On the other hand, the true limiting distribution of 
the test statistic under (Pi, . . . , Pj.) need not be the same as under P as it 
will in general depend on the underlying distributions Pi, . . . , P&. However, 
suppose the choice of test statistic T/v is such that it is an asymptotic pivot 
in the sense that its limiting distribution does not depend on the underlying 
probability distributions. Then, the limiting distribution of the test statistic 
will be the same whether sampling from (Pi, . . . , Pj.) or (P, . . . , P). In such 
cases, the randomization or permutation distribution under (Pi, . . . ,Pk) will 
asymptotically reflect the true unconditional distribution of Tjv, resulting 
in asymptotically valid inference. Indeed, the general results in Section 2 
yield many examples of this phenomenon. However, that these statements 
need qualification is made clear by the following two (somewhat contrived) 
examples. 

Example 5.2. Here, we illustrate a situation where coupling works, 
but the true sampling distribution does not behave like the permutation 
distribution under the mixture model P. In the two-sample setup with m = 
n, suppose Xi, . . . , X n are i.i.d. according to uniformity on the set of x where 
\x\ < 1, and Yi, . . . ,Y n are i.i.d. uniform on the set of y with 2 < \y\ < 3. So, 
E{Xi) = E(Yj) = 0. Consider a test statistic T n ^ n defined as 



^/{|y,|>2}-I{|X J |<2} 



T n:n (X 1 ,...,X n ,Y 1 ,...,Y n ) = N- 1 / 2 

Under the true sampling scheme, T n ^ n is zero with probability one. However, 
if all In observations are sampled from the mixture model, it is easy to 
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see that T n ^ n is asymptotically normal 7V(0, 1/4), which is the same limit 
for the permutation distribution (in probability). So here, the permutation 
distribution under the given distributions is the same as under P, though it 
does not reflect the actual true unconditional sampling distribution. 

Example 5.3. Here, we consider a situation where both populations 
are indeed identical, so there is no need for a coupling argument. However, 
the point is that the permutation distribution does not behave like the true 
unconditional sampling distribution. Assume Xi, . . . , X n and Y\,. . . ,Y n are 
all i.i.d. N(0, 1) and consider the test statistic 

n 
T nin (X x , ...,X n ,Y 1 ,...,Y n )= 7V-V2 J^iXi + Yi). 

*=l 

Unconditionally, T n ^ n converges in distribution to N(Q, 1). However, the per- 
mutation distribution places mass one at -j={X n + Y n ) because the statistic 
T n ^ n is permutation invariant. 

Examples 5.2 and 5.3 show that the intuition provided in the paragraph 
before Example 5.2 does not always work. However, in the two examples, 
the test statistic does not reflect an actual comparison between P and Q. Of 
course, our theorems apply to tests of equality of parameters, and therefore 
the test statistics are based on appropriate differences. 

5.4. An auxiliary contiguity result. Consider the general situation in- 
volving k (possibly distinct) populations for i = l,...,k with n{ observa- 
tions from population i. Set N = ^ i=1 ni and n = (ni, . . . ,nk)', where the 
notation n— > oo means min^ nj — > oo. Assume all N observations are mutu- 
ally independent. Define p n ^ = rii/N — > pi G (0, 1) as rn — > oo for i = 1, . . . , k. 
Let P n be the multinomial distribution based on parameters s = s(n) and 
Pn = (Pn,i, ■ ■ ■ ,Pn,k)- So, under P n , let M n j be the number of observations 
of type i when s observations are taken with replacement from a population 
with ni observations of type i. So, M n = (M Uj i, . . . , M n k) ~ P n - Also, let 
Q n be the multivariate hypergeometric distribution. Under Q n , let H n ^ be 
the number of observations of type i when s observations are taken without 
replacement. So, H n = (iJ n ,i, . . . ,i?„,fc) ~ Q n - 

We shall show that the multinomial distribution P n and the multivariate 
hypergeometric distribution Q m are mutually contiguous, which will allow 
us to obtain the limiting behavior of a statistic under the given samples 
from k probability distributions Pi for i = 1, . . . , k, by instead calculating the 
limiting behavior of the statistic when all ./V observations are i.i.d. from the 
mixture distribution P = ^2i = iPiPi, which is relatively easier to obtain. For 
basic details on contiguity, see Section 12.3 in Lehmann and Romano [16]. 
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Lemma 5.2. Assume the above setup with s/N — y 9 € [0, 1) as n — > oo. 
Consider the likelihood ratio L n {x) = dQ n (x)/dP n (x). 

(i) 27ie limiting distribution of L n (M n ) satisfies 

(5.12) L n (M„) 4 (1 - 0)-(*-W2 ex p j-^^yxLi 

where xt-i denotes the Chi-squared distribution with k — 1 degrees of free- 
dom. 

(ii) Q n and P n are mutually contiguous. 

Remark 5.1. With M„ = (M Ui i, . . .,M n ^) having the multinomial dis- 
tribution with parameters s and p n = (p n ,i, ■ ■ ■ ,Pnk) as i n Lemma 5.2, also 
let M n = (M n< i, . . .,M nt k) have the multinomial distribution with param- 
eters s and p = (pi, . . . ,pk). Then, the distributions of M n and M n are 

— 1/2 

contiguous if and only if p n ^ — Pi = 0{n i ), not just p n> i — > pi for all 
i = 1, . . . , k. 

Lemma 5.3. Suppose V±,...,V S are i.i.d. according to the mixture dis- 
tribution 

k 

p=Y J p l p l , 

where pi € (0, l),Y2i=iPi = 1 an< ^ ^i' s are probability distributions on some 
general space. Assume, for some sequence W n of statistics, 

(5.13) W n (V u ...,V s )^t 

for some constant t (which can depend on the Pi 's and pi 's). Let ni — > oo, 
s(n) -)-oo, with s/N ^9 £ [0,1), N = Y% =1 Tk, Pn,i = ni/N, andp n>i ^pi G 
(0,1) with 

(5.14) Pn^-Pi = 0(n- 1/2 ). 
Further, let Xii, . . . , Xi )Tli be i.i.d. Pi for i = 1, . . . , k. Let 

(Zi,. . . ,Zn) = (-Xi,i>- • • ,Xi jU1 , . . . ,X k) i, . . . ,X kyrik ). 

Let (7r(l), . . . , tt(N)) denote a random permutation of {1, . . . , N} (and inde- 
pendent of all other variables). Then, 

(5.15) W n {Z v{1) ,...,Z <s) )$t. 

Remark 5.2. The importance of Lemma 5.3 is that is allows us to 
deduce the behavior of the statistic W n under the randomization or per- 
mutation distribution from the basic assumption of how W n behaves under 
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i.i.d. observations from the mixture distribution P. Note that in (5.13), the 
convergence in probability assumption is required when the Vi are P (so the 
P over the arrow is just a generic symbol for convergence in probability). 

6. Conclusion. When the fundamental assumption of identical distribu- 
tions need not hold, two-sample permutation tests are invalid unless quite 
stringent conditions are satisfied depending on the precise nature of the prob- 
lem. For example, the two-sample permutation test based on the difference 
of sample means is asymptotically valid only when either the distributions 
have the same variance or they are comparable in sample size. Thus, a care- 
ful interpretation of rejecting the null is necessary; rejecting the null based 
on the permutation tests does not necessarily imply a valid rejection of the 
null that some real-valued parameter 6(F,G) is some specified value 9q. We 
provide a framework that allows one to obtain asymptotic rejection proba- 
bility a in two-sample permutation tests. One great advantage of utilizing 
the proposed test is that it retains the exactness property in finite samples 
when P = Q, a desirable property that bootstrap and subsampling methods 
fail to possess. 

To summarize, if the true goal is to test whether the parameter of interest 
6 is some specified value 9q, permutation tests based on correctly studen- 
tized statistic is an attractive choice. When testing the equality of means, 
for example, the permutation t-test based on a studentized statistic obtains 
asymptotic rejection probability a in general while attaining exact rejection 
probability equal to a when P = Q. In the case of testing the equality of 
medians, the studentized permutation median test yields the same desirable 
property. Moreover, the results extend to quite general settings based on 
asymptotically linear estimators. The results extend to fe-sample problems 
as well, and analogous results hold in fc-sample problem of comparing gen- 
eral parameters, which includes the nonparametric fc-sample Behrens-Fisher 
problem. The guiding principle is to use a test statistic that is asymptotically 
distribution-free or pivotal. Then, the technical arguments developed in this 
paper can be shown that the permutation test behaves asymptotically the 
same as when all observations share a common distribution. Consequently, 
if the permutation distribution reflects the true underlying sampling distri- 
bution, asymptotic justification is achieved. 

As mentioned in the Introduction, proper implementation of a permuta- 
tion test is vital if one cares about confirmatory inference through hypothesis 
testing; indeed, proper error control of types 1, 2 and 3 errors can be ob- 
tained for test of parameters by basing inference on test statistics which are 
asymptotically pivotal. Thus, the foundations are laid for considering more 
complex problems in modern data analysis, such as two-sample microarray 
genomics problems, where a very large number of tests are performed simul- 
taneously. (Indeed, there are many microarray analyses which have begun 
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by performing a permutation test for each gene, without proper studenti- 
zation.) The role of permutations in multiple testing cannot be properly 
understood without a firm basis for single testing. Thus, future work will 
further develop the ideas presented here so that permutation tests can be 
applied to other measures of error control in multiple testing such as the 
false discovery rate. 

SUPPLEMENTARY MATERIAL 

Supplement to "Exact and asymptotically robust permutation tests" (DOI: 
10.1214/13-AOS1090SUPP; .pdf). Contains proofs of all the results in the 
paper. 
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